- Why not linear regression?
- Logistic regression
- Multinomial logistic regression
- Evaluating accuracy
- Bayes theorem, LDA, QDA
- k Nearest Neighbours
9/22/2020
When the \(Y\) we are trying to predict is categorical (or qualitative), we say we have a classification problem.
Qualitative variables take values in an unordered set \(\mathcal{C}\), such as: \(\text{eye color} \in \{\text{brown, blue, green}\}, \text{email} \in \{\text{spam, not spam}\}\).
Given the probability we can classify using a rule like “guess Spam” if \(\mathbb{P}(Y = \text{Spam} \mid x) > 0.5\).
There are a large number of methods for classification:
We will start with binary classification, i.e. the case \(Y \in \{0, 1\} = \{\text{yes, no}\}\) and we want to find the relationship \[f(y = 1 \mid \mathbf{x}) = \mathbb{P}(Y = 1 \mid \mathbf{X} = \mathbf{x}).\]
This gives us also \(\mathbb{P}(Y = 0 \mid \mathbf{X}) = 1 - \mathbb{P}(Y = 1 \mid \mathbf{X})\).
## default student balance income ## 1 No No 729.5265 44361.625 ## 2 No Yes 817.1804 12106.135 ## 3 No No 1073.5492 31767.139 ## 4 No No 529.2506 35704.494 ## 5 No No 785.6559 38463.496 ## 6 No Yes 919.5885 7491.559
Default data:
Can we simply perform a linear regression of \(Y\) on \(X\) and classify as “Yes” if \(\hat{Y} > 0.5\)?
This is called the linear probability model: the probability of a “yes” outcome (\(y = 1\)) is linear in \(x_i\).
The problem is:
Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms.
\[Y = \begin{cases} 1 & \text{if stroke} \\ 2 & \text{if drug overdose} \\ 3 & \text{if epileptic seizure} \end{cases}\]
Logistic regression uses the power of linear modeling and estimates \(\mathbb{P}(Y = y \mid x)\) by using a two step process:
F is called link function: a standard choice is \[F(\eta) = \dfrac{e^\eta}{1 + e^\eta} = \dfrac{1}{1 + e^{-\eta}}\]
The key idea is that \(F(\eta)\) is always between \(0\) and \(1\) so we can use it as a probability. Note that \(F\) is increasing, so if \(\eta\) goes up \(\mathbb{P}(Y = 1 \mid x)\) goes up.
\[F(\eta) = \dfrac{e^\eta}{1 + e^\eta} = \dfrac{1}{1 + e^{-\eta}}\]
This is called the “logistic” or “logit” link, and it leads to the logistic regression model \[\mathbb{P}(Y_i = 1 \mid x_i) = \dfrac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}}\]
This is a very common choice of link function, for a couple of good reasons. One is interpretability: a little algebra shows that
\[\log \left( \dfrac{p}{1-p} \right) = \beta_0 + \beta_1 x_i \Rightarrow \dfrac{p}{1-p} = e^{\beta_0 + \beta_1 x_i}\]
so that it is a log-linear model for the odds of a “yes” outcome.
\[\text{Odds} = \dfrac{p}{1-p} = e^{\beta_0} \cdot e^{\beta_1 x_i}\]
So \(e^{\beta_1}\) is an odds multiplier (or odds ratio) for a one-unit increase in the feature \(x\).
Say that \(\beta_1 = 1\), i.e. \(e^{\beta_1} \approx 2.72\).
If \(p = 0.2\), the odds are \(\dfrac{p}{1-p} = 0.25\). A one-unit increase for \(x\) gives a probability of \(p^{\prime} = 0.4\).
If \(p = 0.8\), the odds are \(\dfrac{p}{1-p} = 4\). A one-unit increase for \(x\) gives a probability of \(p^{\prime} = 0.92\).
Logistic regression gives us a formal parametric statistical model (like linear regression with normal errors)
\[Y_i \sim \text{Bernoulli}(p_i), \quad p_i = F(\beta_0 + \beta_1 x_i)\]
To estimate the parameters \(\beta_0\) and \(\beta_1\), we use maximum likelihood. That is, we choose the parameter values that make the data we have seen most likely.
\[L(\beta_0, \beta_1) = \prod_{i =1}^n \left\{ p_i^{y_i} (1 - p_i)^{1 - y_i} \right\}\]
This likelihood gives the probability of the observed zeros and ones in the data. In R we use the glm
function.
Method to determine the optimal values for the parameters of a model. The parameter values are found such that they maximize the likelihood that the model produced the data that were actually observed
Which curve was most likely responsible for creating the data points that we observed?
Same for linear regression and same here!
## ## Call: ## glm(formula = numeric_def ~ balance, family = "binomial", data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2697 -0.1465 -0.0589 -0.0221 3.7589 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 *** ## balance 5.499e-03 2.204e-04 24.95 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2920.6 on 9999 degrees of freedom ## Residual deviance: 1596.5 on 9998 degrees of freedom ## AIC: 1600.5 ## ## Number of Fisher Scoring iterations: 8
## ## Call: ## glm(formula = numeric_def ~ balance, family = "binomial", data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2697 -0.1465 -0.0589 -0.0221 3.7589 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 *** ## balance 5.499e-03 2.204e-04 24.95 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2920.6 on 9999 degrees of freedom ## Residual deviance: 1596.5 on 9998 degrees of freedom ## AIC: 1600.5 ## ## Number of Fisher Scoring iterations: 8
## ## Call: ## glm(formula = numeric_def ~ balance, family = "binomial", data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2697 -0.1465 -0.0589 -0.0221 3.7589 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 *** ## balance 5.499e-03 2.204e-04 24.95 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2920.6 on 9999 degrees of freedom ## Residual deviance: 1596.5 on 9998 degrees of freedom ## AIC: 1600.5 ## ## Number of Fisher Scoring iterations: 8
A big likelihood is good, so a small deviance is good
When performing model selection, the deviance can be used as out of sample loss function (as we use sum of squared errors for a numeric \(Y\))
We can extend our logistic model to several numeric \(x\) by letting \(\eta\) be a linear combination of the \(x\)’s instead of just a linear function of one \(x\).
\[\log \left( \dfrac{p(\mathbf{X})}{1 - p(\mathbf{X})} \right) = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p\] \[ p(\mathbf{X}) = \dfrac{e^{\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p}}{1 + e^{\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p}}\]
## ## Call: ## glm(formula = numeric_def ~ balance + student, family = "binomial", ## data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4578 -0.1422 -0.0559 -0.0203 3.7435 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.075e+01 3.692e-01 -29.116 < 2e-16 *** ## balance 5.738e-03 2.318e-04 24.750 < 2e-16 *** ## studentYes -7.149e-01 1.475e-01 -4.846 1.26e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2920.6 on 9999 degrees of freedom ## Residual deviance: 1571.7 on 9997 degrees of freedom ## AIC: 1577.7 ## ## Number of Fisher Scoring iterations: 8
The probability of default increases with the balance, but at any fixed balance a student is less likely to default.
## ## Call: ## glm(formula = numeric_def ~ student, family = "binomial", data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.2970 -0.2970 -0.2434 -0.2434 2.6585 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.50413 0.07071 -49.55 < 2e-16 *** ## studentYes 0.40489 0.11502 3.52 0.000431 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2920.6 on 9999 degrees of freedom ## Residual deviance: 2908.7 on 9998 degrees of freedom ## AIC: 2912.7 ## ## Number of Fisher Scoring iterations: 6
Here the coefficient for the student dummy is positive, suggesting that a student is more likely to default.
## ## Call: ## glm(formula = numeric_def ~ student + balance, family = "binomial", ## data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4578 -0.1422 -0.0559 -0.0203 3.7435 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.075e+01 3.692e-01 -29.116 < 2e-16 *** ## studentYes -7.149e-01 1.475e-01 -4.846 1.26e-06 *** ## balance 5.738e-03 2.318e-04 24.750 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2920.6 on 9999 degrees of freedom ## Residual deviance: 1571.7 on 9997 degrees of freedom ## AIC: 1577.7 ## ## Number of Fisher Scoring iterations: 8
But, in the multiple logistic regression, the coefficient for student is negative (students are less likely to default at any fixed level of balance)
Same intuition of multiple linear regression: we know that when the \(x\)’s are correlated the coefficients for the old \(x\)’s can change when we add new \(x\)’s to a model.
Are balance and student “correlated”?
Multiple logistic regression can tease this out
Also referred to as multinomial regression.
So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet
) has the symmetric form
\[\mathbb{P}(Y = k \mid X = x) = \dfrac{e^{\beta_{0k} + \beta_{1k} x_{1} + \dots + \beta_{pk} x_p}}{\sum_{l=1}^K e^{\beta_{0l} + \beta_{1l} x_{1} + \dots + \beta_{pl} x_p}}\]
Here there is a linear function for each class. Note, some cancellation is possible, and only \(K - 1\) linear functions are needed.
Confusion matrix:
## true ## pred 0 1 ## 0 9628 228 ## 1 39 105
Counts on the diagonal are success, while the off-diagonal represent the two kinds of failures you can make:
Confusion matrix:
## true ## pred 0 1 ## 0 9628 228 ## 1 39 105
Total accuracy: sum of the elements on the diagonal divided by total.
“Positive” and “negative” refer to the outcome predicted by the model.
We produced this table by classifying to class “Yes” if \[\mathbb{P}(\text{Default = Yes} \mid \text{Balance, Student}) \geq 0.5\]
We can change the two error rates by changing the threshold from \(0.5\) to some other value \(s \in [0, 1]\) \[\mathbb{P}(\text{Default = Yes} \mid \text{Balance, Student}) \geq s\] and vary the threshold \(s\).
In order to reduce the false negative rate, we may want to reduce the threshold to \(0.1\) or less.
ROC (receiver operator characteristics) curve: looks at the True Positives and False Positives for varying levels of \(s\).
The ROC plot displays both simultaneously. Sometimes we use the AUC or area under the curve to summarize the overall performance (higher AUC is good).
The line is drawn at \(Y = X\). It represents the performance you would get if \(Y\) was independent of \(X\).